Wiktionary for Natural Language Processing: Methodology and Limitations

نویسندگان

  • Emmanuel Navarro
  • Franck Sajous
  • Bruno Gaume
  • Laurent Prévot
  • Shu-Kai Hsieh
  • Ivy Kuo
  • Pierre Magistry
  • Chu-Ren Huang
چکیده

Wiktionary, a satellite of the Wikipedia initiative, can be seen as a potential resource for Natural Language Processing. It requires however to be processed before being used efficiently as an NLP resource. After describing the relevant aspects of Wiktionary for our purposes, we focus on its structural properties. Then, we describe how we extracted synonymy networks from this resource. We provide an in-depth study of these synonymy networks and compare them to those extracted from traditional resources. Finally, we describe two methods for semiautomatically improving this network by adding missing relations: (i) using a kind of semantic proximity measure; (ii) using translation relations of Wiktionary itself. Note: The experiments of this paper are based on Wiktionary’s dumps downloaded in year 2008. Differences may be observed with the current versions available online.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

IWNLP: Inverse Wiktionary for Natural Language Processing

Nowadays, there are a lot of natural language processing pipelines that are based on training data created by a few experts. This paper examines how the proliferation of the internet and its collaborative application possibilities can be practically used for NLP. For that purpose, we examine how the German version of Wiktionary can be used for a lemmatization task. We introduce IWNLP, an openso...

متن کامل

Salehi, Bahar, Paul Cook and Timothy Baldwin (to appear) Detecting Non-compositional MWE Components using Wiktionary, In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar

We propose a simple unsupervised approach to detecting non-compositional components in multiword expressions based on Wiktionary. The approach makes use of the definitions, synonyms and translations in Wiktionary, and is applicable to any type of MWE in any language, assuming the MWE is contained in Wiktionary. Our experiments show that the proposed approach achieves higher F-score than state-o...

متن کامل

Dbnary: Wiktionary as a LMF based Multilingual RDF network

Contributive resources, such as wikipedia, have proved to be valuable in Natural Language Processing or Multilingual Information Retrieval applications. This article focusses on Wiktionary, the dictionary part of the collaborative resources sponsored by the Wikimedia

متن کامل

The comparison of Wiktionary thesauri transformed into the machine-readable format

Institution of the Russian Academy of Sciences St.Petersburg Institute for Informatics and Automation RAS Phone: +7 (812) 328-80-71 Fax: +7 (812) 328-44-50 andrew dot [email protected] http://code.google.com/p/wikokit/ Wiktionary is a unique, peculiar, valuable and original resource for natural language processing (NLP). The paper describes an open-source Wiktionary parser: its architectur...

متن کامل

Extracting Lexical-Semantic Knowledge from the Portuguese Wiktionary

Public domain collaborative resources like Wiktionary and Wikipedia have recently become attractive sources for information extraction. To use these resources in natural languague processing (NLP) tasks, efficient programmatic access to their contents is required. In this work, we have extracted semantic relations automatically from the Portuguese Wiktionary and compared our results with the re...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009